Three-Dimensional Speaker Localization: Audio-Refined Visual Scaling Factor Estimation

نویسندگان

چکیده

Neither a monocular RGB camera nor small-size microphone array is capable of accurate three-dimensional (3D) speaker localization. By taking advantage visual object detection, and audio-visual complementary sensor fusion, we formulate the localization problem as scaling factor estimation problem. As result, effectively reduce traditional audio-only 3D from an exhaustive grid search to one-dimensional (1D) optimization We propose multi-modal perception system with two approaches. show that proposed methods are effective, accurate, robust against interference and, corroborated by indicative empirical results on real dataset, competitive conventional uni-modal state-of-the-art

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Audio-Visual Clustering for Multiple Speaker Localization

We address the issue of identifying and localizing individuals in a scene that contains several people engaged in conversation. We use a human-like configuration of sensors (binaural and binocular) to gather both auditory and visual observations. We show that the identification and localization problem can be recast as the task of clustering the audio-visual observations into coherent groups. W...

متن کامل

Audio-visual SPeaker localization for car navigation systems

Human-computer interaction for in-vehicle information and navigation systems is a challenging problem because of the diverse and changing acoustic environments. It is proposed that the integration of video and audio information can significantly improve dialog system performance, since the visual modality is not impacted by acoustic noise. In this paper, we propose a robust audio-visual integra...

متن کامل

AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking

Assessing the quality of a speaker localization or tracking algorithm on a few short examples is difficult, especially when the groundtruth is absent or not well defined. One step towards systematic performance evaluation of such algorithms is to provide time-continuous speaker location annotation over a series of real recordings, covering various test cases. Areas of interest include audio, vi...

متن کامل

Two- and Three-Dimensional Audio-Visual Speech Synthesis

An audio-visual speech synthesiser has been built that will generate animated computer-graphics displays of high-resolution, colour images of a speaker's mouth area. The visual displays can simulate the movements of the lower face of a talker for any spoken sentence of British English, given a text input. The synthesiser is based on a data-driven technique. It uses encoded, video-recorded image...

متن کامل

Audio-visual speaker conversion using prosody features

The article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Signal Processing Letters

سال: 2021

ISSN: ['1558-2361', '1070-9908']

DOI: https://doi.org/10.1109/lsp.2021.3092959